Code
old_theme <- theme_set(theme_minimal())M1 MIDS/MFA/LOGOS |
| Année 2025 |
The aim of this lab is to showcase the capabilities of dplyr to handle tables. In R, tables are basically handled as data frames. Provides a lot of tools to handle data frames. dplyr should be thought as a concise and consistent API for performing table calculus (the so-called relational algebra from database theory). dplyr offers verbs that implement the main operations of relational algebra: projection, selection, joins, grouping, and aggregation.
We will use the following packages. If needed, we install them.
old_theme <- theme_set(theme_minimal())Check nycflights13 for any explanation concerning the tables and their columns.
We will use data from package nycflights13. Those data consist of five tables.
The five tables have been loaded as 5 data frames. They could also be considered as tables. Living in an in memory relational database operated by SQLite.
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
flights_lite <- copy_to(con, nycflights13::flights)
airports_lite <- copy_to(con, nycflights13::airports)
planes_lite <- copy_to(con, nycflights13::planes)
weather_lite <- copy_to(con, nycflights13::weather)
airlines_lite <- copy_to(con, nycflights13::airlines)This allows us to handle the tables/data frames as tables living in a plain database. If you are familiar with SQL, this is well-known territory.
flights_lite |>
select(contains("delay")) |>
show_query()<SQL>
SELECT `dep_delay`, `arr_delay`
FROM `nycflights13::flights`
Anyway you can view the data in spreadsheet style.
View(flights)Ask for help about table flights (use head and glimpse)
dplyr way)Find all flights that:
Hint: use verb filter.
IAH or HOU)Package stringr could be useful.
airlines |>
filter(stringr::str_starts(name, "United") |
stringr::str_starts(name, "American") |
stringr::str_starts(name, "Delta"))# A tibble: 3 × 2
carrier name
<chr> <chr>
1 AA American Airlines Inc.
2 DL Delta Air Lines Inc.
3 UA United Air Lines Inc.
airlines |>
filter(stringr::str_detect(name, ("United|American|Delta"))) |>
pull(carrier)[1] "AA" "DL" "UA"
Unfortunately, the next code is not supported.
#| eval: false
airlines_lite |>
filter(stringr::str_starts(name, "United") |
stringr::str_starts(name, "American") |
stringr::str_starts(name, "Delta")) |>
show_query()It would be the dplyr version of the following SQL code.
SELECT *
FROM `nycflights13::airlines`
WHERE "name" LIKE 'United%' OR
"name" LIKE 'American%' OR
"name" LIKE 'Delta%' ;stringr is part of tidyverse. It offers an API for string manipulations.
When manipulating temporal information (date, time, duration), keep an eye on what lubridate offers. The API closely parallels what Relational Database Management Systems (RDMS) and Python (datetime) offer.
origin have a missing dep_time?The introduction to tidyselect is a must read.
flight report the number of missing values.arrange() to sort all missing values to the start? (Hint: use is.na()).dep_delay)The database provides all we need with columns distance and air_time. Otherwise, with the positions of airports from table airports, we should be able to compute distances using :
‘Haversine’ formula.
dep_time, dep_delay, arr_time, and arr_delay from flights.
select() call?any_of() function do? Why might it be helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(
flights,
contains("TIME", ignore.case =TRUE)) |>
head()# A tibble: 6 × 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00
dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?Carefully read the documentation for min_rank().
# A tibble: 105 × 2
dest n_cancelled
<chr> <int>
1 ABQ 0
2 ACK 0
3 ALB 20
4 ANC 0
5 ATL 317
6 AUS 21
7 AVL 12
8 BDL 31
9 BGR 15
10 BHM 25
# ℹ 95 more rows
Warning: Missing values are always removed in SQL aggregation functions.
Use `na.rm = TRUE` to silence this warning
This warning is displayed once every 8 hours.
<SQL>
SELECT `dest`, SUM((`dep_time` IS NULL)) AS `n_cancelled`
FROM `nycflights13::flights`
GROUP BY `dest`
not_cancelled |> count(dest) and (without usingcount()`).(is.na(dep_delay) | is.na(arr_delay) ) is slightly suboptimal. Why? Which is the most important column?Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarise(n()))
sort argument to count() do. When might you use it?carriers serve all destination airports (in the table) ?tailnum) has the worst on-time record amongst planes with at least ten flights?Using dplyr, it is easy. See A second look at group_by
lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.lag() is an example of window function. If we were using SQL, we would define a WINDOW using an expression like
WINDOW w As (PARTITION BY origin ORDER BY year, month, day, sched_dep_time)Something still needs fixing here: some flights never took off (is.na(dep_time)). Should they be sided out? assigned an infinite departure delay?
Consider all flights with average speed above \(950\text{km/h}\) as suspicious.
Let us visualize destinations and origins of the speedy flights.
Assume a plane is characterized by tailnum. Some flights have no tailnum. We ignore them.